31<sup>st</sup> IEEE International Symposium on Computer Arithmetic



# Combining Power and Arithmetic Optimization via Datapath Rewriting

Samuel Coward, Emiliano Morini, Theo Drane, George A. Constantinides

intel

Numerical Hardware Group, Intel Corporation EEE, Imperial College London Motivation



### Motivation

A∭ē



### Motivation

A∭G



# E-Graphs



7231

fastly

a = 1 \* a = (1 \* (1 \* a))

CVC5

- Compact representation of equivalent designs
- Maintains history
- Combine program analysis & rewriting
- Constructive rewrite application phase ordering



IP Engineering – GFx Numerical Hardware Group

#### E-Syn: E-Graph Rewriting with Technology-Aware Cost Functions for Logic Synthesis

Chen Chen\* HKUST(GZ) cchen099@connect.hkust-gz.edu.cn Guangyu Hu\* HKUST ghuae@connect.ust.hk Dongsheng Zuo HKUST(GZ) dzuo721@connect.hkust-gz.edu.cn

### Equality Saturation for Datapath Synthesis: A Pathway to Pareto Optimality

Ecenur Ustun<sup>1</sup>, Cunxi Yu<sup>2</sup>, and Zhiru Zhang<sup>1</sup>

### **ESFO: Equality Saturation for FIRRTL Optimization**

Yan Pi

Hongji Zou

Tun Li

#### There and Back Again: A Netlist's Tale with Much Egraphin'

Gus Henry Smith<sup>†</sup>, Zachary D. Sisco<sup>‡</sup>, Thanawat Techaumnuaiwit<sup>‡</sup>, Jingtao Xia<sup>‡</sup>, Vishal Canumalla<sup>†</sup>, Andrew Cheung<sup>†</sup>, Zachary Tatlock<sup>†</sup>, Chandrakana Nandi<sup>§</sup>, Jonathan Balkind<sup>‡</sup>



#### ROVER – RTL Opt. via Verified E-Graph Rewriting а С Datapath New: power Extraction CSA optimization rewrites Looks like 2 CPA Just1CSA Rewrite **ROVER** Front-End **ROVER Back-End** Extraction eqq Input Optimized

E-Graph

VeriLang

#### Measuring Power

Verilog



VeriLang

Verilog

A∭ē

### ROVER - RTL Opt. via Verified E-Graph Rewriting



A∭G

### Don't Cares – hardware is not software

- Muxes introduce potential redundancy
- Redundant computations = wasted power

DAC 2023 – exploit don't care for performance

$$a?f(x):g(y) \rightarrow a?f'(x):g(y)$$

Such that:

 $f'(x) = \begin{cases} f(x) & \text{if } a \\ don't \text{ care else} \end{cases}$ 



### Redundant Computation & Wasted Watts





### Redundant Computation & Wasted Watts



A∭e

## 1) Data Gating/Operand Isolation – Zeroing Datapath

- Create a mask from select signal
- Bitwise AND with datapath inputs





# 1) Data Gating/Operand Isolation – Zeroing Datapath

- Create a mask from select signal
- Bitwise AND with datapath inputs

| Left                                        | Right                                         |
|---------------------------------------------|-----------------------------------------------|
| s ? b : c                                   | $s?(b\&\{w_b\{s\}\}):(c\&\{w_c\{\bar{s}\}\})$ |
| $(a \ op \ b) \& \{w_o\{s\}\}$              | $(a \& \{w_o\{s\}\}) op (b \& \{w_b\{s\}\})$  |
| $(a \& \{w_a \{s_1\}\}) \& \{w_a \{s_2\}\}$ | $(a \& \{w_a \{s_1 \& s_2\}\})$               |







Combo

FF

FF

В



S



# 1) Data Gating/Operand Isolation – Zeroing Datapath

- Create a mask from select signal
- Bitwise AND with datapath inputs

| Left                                        | Right                                         |
|---------------------------------------------|-----------------------------------------------|
| s ? b : c                                   | $s?(b\&\{w_b\{s\}\}):(c\&\{w_c\{\bar{s}\}\})$ |
| $(a \ op \ b) \& \{w_o\{s\}\}$              | $(a \& \{w_o\{s\}\}) op (b \& \{w_b\{s\}\})$  |
| $(a \& \{w_a \{s_1\}\}) \& \{w_a \{s_2\}\}$ | $(a \& \{w_a \{s_1 \& s_2\}\})$               |

#### Considerations:

- Toggle rate of select signal S
- Toggle rate of data signals A,B
- Timing impact & gate count



# 2) Transparent Latch/Register – Operand Isolation

- When enabled is transparent
- When disabled input is "frozen" no power consumed in datapath





## 2) Transparent Latch/Register – Operand Isolation

- When enabled is transparent
- When disabled input is "frozen" no power consumed in datapath





# 2) Transparent Latch/Register – Operand Isolation

- When enabled is transparent
- When disabled input is "frozen" no power consumed in datapath





| Left            | Right                                 |  |  |
|-----------------|---------------------------------------|--|--|
| s?b:c           | $s$ ? $TREG(b,s)$ : $TREG(c,\bar{s})$ |  |  |
| TREG(a op b, s) | $TREG(a, s) \ op \ TREG(b, s)$        |  |  |
| REG(a, en)      | REG(TREG(a, en), en)                  |  |  |

#### **Considerations:**

- Clock gating available?
- Timing impact & gate count

# 3) Clock Gating – Switching off Datapath

- Modify register enable signals
- Construct observability don't care conditions



#### Siemens White Paper



# 3) Clock Gating – Switching off Datapath

- Modify register enable signals
- Construct *observability don't care* conditions



With clock gating Combinational analysis Sequential analysis

#### $TREG(REG(A, en), REG(B, en)) \rightarrow REG(A, en \& B)$







#### Siemens White Paper



# 3) Clock Gating – Switching off Datapath

- Modify register enable signals
- Construct *observability don't care* conditions



#### Siemens White Paper

en  $B \rightarrow 1$  Q  $A \rightarrow 0$   $A \rightarrow$ 

# 

^Gated Reg^

#### **Considerations:**

• Availability of clock gating signals

 $TREG(REG(A, en), REG(B, en)) \rightarrow REG(A, en \& B)$ 

• Timing impact & gate count



A∭ē

### Combining Power and <u>Arithmetic</u> Optimization

- Arithmetic rewrites associativity, carry-save
- Logic rewrites mux tree rearrangement
- Logic arithmetic exchange

# All rewrites applied and combined during exploration



### Computationally Efficient E-Graph Simulation



#### Extraction:

- Nodes in an e-class use more/less energy
- Estimate node cost based on switching activities and operator area
- Extract best using ILP

Untapped simulation/ testing opportunities?



Intel

Datapath Extraction Aware Case Study





Datapath Extraction Aware Case Study



 $r_0$ 

 $m_2$ 

### Results

#### Up to 34% less power, 24% on average, 5% average area overhead





### Results

#### Up to 34% less power, 24% on average, 5% average area overhead





#### UNREACHABLE!!!

• Resource sharing -not expressible as local equivalence preserving rewrites...



# Backup



### Results

| Benchmark              | Nadaa | Baseline         |                 | Area Optimized   |                 | Power Optimized  |                 |
|------------------------|-------|------------------|-----------------|------------------|-----------------|------------------|-----------------|
| Delicimark             | Nodes | Area $(\mu m^2)$ | Power $(\mu W)$ | Area $(\mu m^2)$ | Power $(\mu W)$ | Area $(\mu m^2)$ | Power $(\mu W)$ |
| Comb. Mux Add Tree     | 20    | 32.9             | 98.2            | 32.8 (- 0.4%)    | 98.8 (+ 0.5%)   | 31.0(- 7.4%)     | 83.2 (-15.5%)   |
| Address Generation     | 22    | 58.5             | 421.9           | 57.1 (- 0.2%)    | 419.2 (-0.6%)   | 57.2 (+ 2.2%)    | 301.2 (-28.7%)  |
| Weight Calculation     | 81    | 51.6             | 1141.4          | 46.4 (-10.2%)    | 1072.3 (-6.1%)  | 53.3 (+ 3.2%)    | 871.5 (-23.7%)  |
| Pipe. Mux Add Tree [9] | 23    | 38.6             | 852.3           | 38.6 ( 0.0%)     | 852.3 ( 0.0%)   | 44.1 (+14.9%)    | 615.3 (-27.2%)  |
| Dual Op ALU [10]       | 17    | 6.5              | 186.9           | 6.5 ( 0.0%)      | 186.9 ( 0.0%)   | 7.5 (+15.1%)     | 146.8 (-21.3%)  |
| Sequential Reg [13]    | 13    | 12.4             | 579.6           | 12.4 ( 0.0%)     | 579.6 ( 0.0%)   | 12.8 (+ 2.9%)    | 383.0 (-33.9%)  |
| Dual Path FP Sub [4]   | 62    | 27.8             | 1097.1          | 27.8 ( 0.0%)     | 1097.1 ( 0.0%)  | 29.0 (+ 3.5%)    | 929.4 (-14.9%)  |





## **Clock Gate Proof**

$$\begin{split} L_i &= \texttt{TREG}(\texttt{REG}(a_i, en_i), \texttt{REG}(b_i, en_i)) \text{ and } \\ R_i &= \texttt{REG}(a_i, en_i \& b_i) \end{split}$$

for all clock cycles i via induction. First, let

$$p_i = \operatorname{REG}(a_i, en_i)$$
 and  $q_i = \operatorname{REG}(b_i, en_i)$ .

Suppose  $\forall i \leq k \ L_i = R_i$ , then if  $en_k = 1$ :

$$q_{k+1} = b_k \qquad p_{k+1} = a_k$$
  

$$L_{k+1} = q_{k+1} ? p_{k+1} : L_k$$
  

$$= b_k ? a_k : L_k$$

Then, since  $en_k = 1$  and  $R_k = L_k$ ,

$$R_{k+1} = en_k \& b_k ? a_k : R_k$$
  
=  $b_k ? a_k : R_k = L_{k+1}$ 

Now if  $en_k = 0$ , then  $R_{k+1} = R_k = L_k$  and

$$q_{k+1} = q_k \qquad p_{k+1} = p_k$$

$$L_{k+1} = q_k ? p_k : L_k$$

$$q_k = 1 \Rightarrow L_k = q_k ? p_k : L_{k-1} = p_k$$

Therefore  $L_{k+1} = L_k = R_{k+1}$  and hence  $R_{k+1} = L_{k+1}$  for all values of  $en_k$ . Under the zero register initialization assumption it is trivial to prove  $L_0 = R_0$ .

